JFruit2 Optimisation¶

This tutorial will cover the use of black-box optimisation on the JFruit2 simulation model.

Black-box optimisation applied to the JFruit2 fruit model means searching for parameter values, like growth rates, transpiration coefficients, or sugar transport constants that make the model outputs best match observed fruit data, without needing access to or derivatives of the internal equations.

The optimiser only "sees" the model's inputs and outputs, evaluating how well each trial run fits experimental or target outcomes. This approach is especially useful when the biological processes are too complex, nonlinear, or computationally expensive to model analytically, allowing calibration of JFruit2 purely through iterative evaluation and intelligent search.

Imports¶

We will first import all required dependencies.

In [25]:
import pandas as pd
import numpy as np
from jfruit2 import JFruit2
import os.path as osp
import os
import shutil

from calisim.data_model import (
	DistributionModel,
	ParameterDataType,
	ParameterSpecification,
)

from calisim.optimisation import OptimisationMethod, OptimisationMethodModel
from calisim.statistics import MeanSquaredError, MeanPinballLoss

Observed data¶

We will next load the observed field data.

In [2]:
observed_data = JFruit2.get_observed_data()
observed_data
Out[2]:
age_d age_h pip piv pp pv sfstone af ctcs pif ... w s pf1 pf2 pf v dD mSol mSta mSyn
0 2 1 13.778 13.778 9.8344 9.8344 0.0 0.57664 0.000267 11.961 ... 0.033638 0.010665 0.98543 4.7757 0.98543 0.040303 -0.37917 0.000646 0.000103 0.004049
1 2 2 13.774 13.774 9.8308 9.8308 0.0 0.58355 0.000264 11.876 ... 0.034391 0.010694 0.98872 5.1422 0.98872 0.041074 -0.74167 0.000654 0.000103 0.004054
2 2 3 13.774 13.774 9.8304 9.8304 0.0 0.59076 0.000261 11.804 ... 0.035182 0.010724 0.98789 5.0852 0.98789 0.041885 -1.08540 0.000661 0.000103 0.004059
3 2 4 13.771 13.771 9.8273 9.8273 0.0 0.59815 0.000258 11.716 ... 0.036000 0.010754 0.98967 5.3017 0.98967 0.042721 -1.41250 0.000668 0.000103 0.004064
4 2 5 13.767 13.767 9.8238 9.8238 0.0 0.60613 0.000254 11.619 ... 0.036889 0.010784 0.99307 5.6945 0.99307 0.043629 -1.70830 0.000676 0.000103 0.004070
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3787 159 20 13.827 13.827 9.8834 9.8834 0.0 194.40000 0.000611 20.593 ... 190.860000 39.733000 0.94875 10.3730 0.94875 215.700000 -3046.20000 8.391700 1.092500 7.243600
3788 159 21 13.821 13.821 9.8781 9.8781 0.0 194.45000 0.000610 20.527 ... 190.940000 39.733000 0.94891 12.5310 0.94891 215.780000 -3047.00000 8.392100 1.092100 7.243600
3789 159 22 13.817 13.817 9.8737 9.8737 0.0 194.50000 0.000610 20.470 ... 191.030000 39.733000 0.94898 13.5650 0.94898 215.870000 -3047.80000 8.392600 1.091700 7.243600
3790 159 23 13.814 13.814 9.8711 9.8711 0.0 194.56000 0.000610 20.433 ... 191.130000 39.733000 0.94902 14.1340 0.94902 215.970000 -3048.50000 8.393100 1.091300 7.243600
3791 159 24 13.813 13.813 9.8693 9.8693 0.0 194.61000 0.000610 20.406 ... 191.230000 39.734000 0.94901 13.9960 0.94901 216.070000 -3049.20000 8.393800 1.090900 7.243600

3792 rows × 33 columns

Calibration procedure¶

Single-objective optimisation¶

We will next run the calibration procedures against observed fruit water mass (w) data. In this case, we will perform black-box optimisation.

We begin by specifying our parameter distributions. We will calibrate 3 parameters:

  • Growth.lp: Conductivity of the composite membrane for water transport (phloem)
  • Growth.lxAsy: Negative asymptote of the inverse logistic curve to calculate hydraulic conductivity
  • Growth.kx: Maximum slope of the inverse logistic curve to calculate hydraulic conductivity

The ranges below were identified through consultation with several scientists at both Plant and Food Research (now the Bioeconomy Science Institute) and INRAe.

In [3]:
parameter_spec = ParameterSpecification(
	parameters=[
		DistributionModel(
			name="Growth.lp",
			distribution_name="uniform",
			distribution_args=[0.001, 0.004],
			data_type=ParameterDataType.CONTINUOUS,
		),
		DistributionModel(
			name="Growth.lxAsy",
			distribution_name="uniform",
			distribution_args=[0.01, 0.015],
			data_type=ParameterDataType.CONTINUOUS,
		),
		DistributionModel(
			name="Growth.kx",
			distribution_name="uniform",
			distribution_args=[0.0019, 0.0038],
			data_type=ParameterDataType.CONTINUOUS,
		),
	]
)

We'll define an objective function for our optimiser, which will return the discrepancy between observed and simulated fruit water mass values as determined by the mean squared error (MSE) metric. We'll aim to maximise the predictive accuracy of JFruit2 by minimising the MSE returned by our objective function. This is a single objective optimisation problem.

In [4]:
def objective(
	parameters: dict, simulation_id: str, observed_data: np.ndarray | None
) -> float | list[float]:
    model = JFruit2()
    props = model.load_properties()
    for k in parameters:
        props[k] = parameters[k]
    model.save_properties(props)

    model.run(
        properties = osp.join("data", "out", f"{model.sim_id}.properties")
    )
    simulated_data = model.results.w.values
    observed_data = observed_data.w.values
    metric = MeanSquaredError()
    discrepancy = metric.calculate(observed_data, simulated_data)
    return discrepancy

We'll next run our black-box optimisation procedure using the Tree-Structured Parsen Estimator (TPES) from the Optuna library.

TPES is a Bayesian optimisation method that uses two density-estimation surrogate models to learn the distribution of parameters from JFruit2 that lead to good and bad results where:

  • $l(x) = p(x \mid y < y^*) $: models parameter values that produced good results
  • $g(x) = p(x \mid y \ge y^*) $: models parameter values that produced bad results

Here $x$ are our parameter sets and $y^*$ is a performance threshold (often a quantile of observed losses). New candidate parameter sets are chosen to maximise the ratio l(x) / g(x). In other words, we want to evaluate solutions that look like past winners but are still uncertain. TPES is efficient for high-dimensional and mixed-type search spaces.

More information on TPES can be found using this link.

In [5]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

specification = OptimisationMethodModel(
    experiment_name="optuna_optimisation",
    parameter_spec=parameter_spec,
    observed_data=observed_data,
    method="tpes",
    output_labels=["Water Mass Discrepancy"],
    directions=[
        "minimize"
    ],
    n_jobs=1,
    n_iterations=35,
    method_kwargs=dict(n_startup_trials=10),
)

calibrator = OptimisationMethod(
    calibration_func=objective, specification=specification, engine="optuna"
)

calibrator.specify().execute().analyze()
Out[5]:
<calisim.optimisation.implementation.OptimisationMethod at 0x7faf34083c70>

The empirical distribution function plot depicts the proportion of trials containing our objective value. Roughly 32% of trials have an MSE below 2000.

We see diminishing results in MSE beyond trial 30. So that is when optimiser begins to converge.

The parallel coordinate plot shows us the relationship between our objective values against our 3 parameters using horizontal curves. Darker curves indicate better performance. For our parallel coordinate plot, we see less spread in the trajectories of the horizontal curves at Growth.lp and Growth.kx. And greater spread for Growth.lxAsy.

According to the fANOVA sensitivity analysis, Growth.lp is the most important parameter. This is consistent with the results of the Sobol sensitivity analysis.

Our slice plots show the sampling history of TPES over trials. It quickly converges to a solution.

We'll extract the point estimates from our calibrator.

In [6]:
optimisation_df = pd.DataFrame({
    "parameter_name": [ 
        model.name
        for model in calibrator.get_parameter_estimates().estimates
    ],
    "parameter_estimate": [ 
        model.estimate
        for model in calibrator.get_parameter_estimates().estimates
    ]
})
optimisation_df
Out[6]:
parameter_name parameter_estimate
0 Growth.kx 0.003677
1 Growth.lp 0.001050
2 Growth.lxAsy 0.010132

We see the parameter estimates above for our 3 parameters. We can run JFruit2 again using these optimised estimates, and compare the simulated and observed fruit water mass values.

In [7]:
parameters = { 
    row["parameter_name"]: row["parameter_estimate"]
    for row in optimisation_df.to_dict("records")
}
parameters

model = JFruit2()
props = model.load_properties()
for k in parameters:
    props[k] = parameters[k]
model.save_properties(props)

model.run(
    properties = osp.join("data", "out", f"{model.sim_id}.properties")
)

pd.DataFrame({
    "observed": observed_data.w.values,
    "simulated": model.results.w.values
}).plot.scatter("observed", "simulated")
Out[7]:
<Axes: xlabel='observed', ylabel='simulated'>
No description has been provided for this image

After calibration via black-box optimisation, the predictive accuracy of our JFruit2 model is quite high. The model's behaviour more closely reflects that of reality.

Multi-objective optimisation¶

Let's repeat this process. But this time, we will incorporate a second objective for optimisation. Hence, we are performing multi-objective optimisation rather than single-objective optimisation. For more information, click this link.

Let's aim to minimise the discrepancy between both simulated and observed carbon-content in soluble sugar (mSol) and simulated and observed fruit water mass (w). We will use the mean pinball loss metric for mSol, and the MSE for w.

We can re-use our parameter specification from single-objective optimisation, but we'll need to define a new objective function.

In [27]:
def objective(
	parameters: dict, simulation_id: str, observed_data: np.ndarray | None
) -> float | list[float]:
    model = JFruit2()
    props = model.load_properties()
    for k in parameters:
        props[k] = parameters[k]
    model.save_properties(props)

    model.run(
        properties = osp.join("data", "out", f"{model.sim_id}.properties")
    )
    
    simulated_w = model.results.w.values
    observed_w = observed_data.w.values
    simulated_mSol = model.results.mSol.values
    observed_mSol = observed_data.mSol.values 
    
    w_metric = MeanSquaredError()
    mSol_metric = MeanPinballLoss()
    w_discrepancy = w_metric.calculate(observed_w, simulated_w)
    mSol_discrepancy = mSol_metric.calculate(observed_mSol, simulated_mSol)
    
    return w_discrepancy, mSol_discrepancy

Let's run the multi-objective optimisation procedure, again using the TPES algorithm from the Optuna library.

In [29]:
specification = OptimisationMethodModel(
    experiment_name="optuna_optimisation",
    parameter_spec=parameter_spec,
    observed_data=observed_data,
    method="tpes",
    output_labels=["Water Mass Discrepancy", "Carbon Content Discrepancy"],
    directions=[
        "minimize", "minimize"
    ],
    n_jobs=1,
    n_iterations=25,
    method_kwargs=dict(n_startup_trials=10),
)

calibrator = OptimisationMethod(
    calibration_func=objective, specification=specification, engine="optuna"
)

calibrator.specify().execute().analyze()
Out[29]:
<calisim.optimisation.implementation.OptimisationMethod at 0x7faf30acae60>

We can see some useful visualisations for both our objectives. But we'll focus on the Pareto-front plot.

For single-objective optimisation, you will end up with a single best solution with a single optimal parameter set. The aim of multi-objective optimisation is to identify the Pareto optimal set between your different objectives. Formally, a solution is Pareto optimal if no other solution can improve one objective without worsening at least one other objective. Informally, we want every objective to "win", and no objective to "lose".

In the Pareto-front plot, we see the trade-off between the Water Mass discrepancy and Carbon Content discrepancy objectives. We could then select one parameter set from our Pareto optimal set depending on whether we wish to favour Water Mass discrepancy or Carbon Content discrepancy.